For this project I will be exploring the data of the white wines database. This dataset contains around 4900 records with different quantitative variables and a quality variable that determines the expert option of the wine. I plan to explore what factors, if any, contribute to the quality of the wine.
## Number of observations in dataset:
## 4898
## Number of variables in dataset:
## 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
You can see from the chart that we have a normal distribution for quality of the wines. With the most common rating being a 6 followed by 5 then 7.
From the charts above we can see that most of the variables follow a normal distribution except for residual sugar. I cleaned the dataset and got rid of all the data above the 95% quantile. Once we have the cleaned data, i applied 3 different log transformations to understand the distribution. first was log base 10 then log base 2 then squareroot x scale.
The Residual sugar histogram now denotes a more normal distribution. One interesting thing it that the residual sugar histrogram has a bimodal look with multiple peaks.
For this dataset there are 4,989 oberservations with 11 quantitative variables. Those variables are pH, Alcohol, fixed acidity, volatile acidity, citric acid, chlorides, free sulfur dioxide, total sulfur dioxide, density, residual sugar, and sulphates. There is also one subjective variable, quality, which gives a way to determine what factors into a better quality raiting.
Personally, being a wine drinker, I’d love to determine what factors go into making a better quality wine. I want to find out which variables have a positve effect on quality and which ones have a negative effect on quality.
At this point in the investigation proccess, it is important to investigate the replation ship between all variables. Any of the variables could impact the quality number positively. At this point I suspect that pH and alcohol will have a big effect on quality.
At this point the only variable I created is the bound sulfur dioxide variable. I created it taking the total sulfur dioxide and subtracting the free sulfur dioxide. I have only created a basic histogram for it, but I plan on using it during the bivariate analysis. ### Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this? The only features that have an unuasual distribution was residual sugar. This data has a distribution with two peaks around 2 and 8 this leads me to believe there are two types of white wines in the dataset, those with higher sugar content and those with a lower sugar content. I did have to clean up the dataset with regards to residual sugar. There was a major outlier at 65. So i cleaned up the data and only included the data points with those that are lower than 95% quantile. After the data was cleaned up, it was easier to see the distribution.
This is the ggpairs plot for all of the variables. From this I decided that I needed to get the correlations in a different format so I examine what variables I want to focus on.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
From this we see that there are a few vairables that have a strong correlation on the other. Specifically the relationship between Residual Sugar, density, alcohol, and quality.
Next I created a new data frame that groups the data based on the quality rating. I am looking for the mean of the alcohol, pH, Density, and Residual Sugar. We can then plot those to see how the mean is affected by the quality of the wine.
We can see a strong correlation between residual sugar and the density of the wine. This makes sense as the more sugar is in each wine it would stand to reason that it would be more dense since Sugar is denser that water.
## # A tibble: 7 x 6
## quality alcohol_mean pH_mean density_mean residualSugar_mean n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 10.3 3.19 0.995 6.39 20
## 2 4 10.2 3.18 0.994 4.63 163
## 3 5 9.81 3.17 0.995 7.33 1457
## 4 6 10.6 3.19 0.994 6.44 2198
## 5 7 11.4 3.21 0.992 5.19 880
## 6 8 11.6 3.22 0.992 5.67 175
## 7 9 12.2 3.31 0.991 4.12 5
Next I broke the data set into a long format in order to calculate the mean of 4 variables by quality. I will use these means to see what trends exist in the data
From these plots we can see that as the quality of the wine increase so does the mean of alcohol and pH. Inversely we can see that as quality goes up the mean of density and residual sugar lowers
So While exploring this data set and looking for relationships, some correlations that stuck out to me would be that as quality rating of the wine increases so does the mean of the alcohol content. The same goes for pH mean as well. The higher the quality the more basic the pH is. The other interesting relationship that showed up was that as the density of the wine decreases the quality increases. This would lead me to reason that the lighter the wine is the better quality rating it recieves.
One of the more interesting things to me that the relationship between residual sugar and density of the wine. This relationship stands to reason because sugar is more dense than just grape juice.
The strongest relationship I found was that between residual sugar and density. With an r value of about .89 this was the most correlated relationship. The Second most fascinating relationship that I found would have to be between alcohol and quality. With an r value of about .43 this relationship shows us that the higher quality wines also have higher alcohol content.
cleanSugarWines$quality <- cut(cleanSugarWines$quality,
breaks=c(-Inf, 5, 7, Inf),
labels=c("low","medium","high"))
For the multivariate plots I will first break the data up into quality groups. I decided that anything less than 5 or less is low quality, 6 and 7 are medium quality, and 8 and 9 are high quality wines. The breakdown we have is 1616 low quality wines, 3052 medium quality wines, and 180 high quality wines
From this chart we can see the correlations based on the quality of the wines. We see that the high quality wines have lower residual sugar content that the medium and low quality wines. The correlation between density and residual sugar for the medium quality lines have a slighty higher correlation vs the other qualities.
From this chart we see that there is a negative correlation between residual sugar and alcohol percentage. It is interesting to me that the higher quality wines have a higher alcohol content vs the other qualities. This leads us to reason that as residual sugar increases the alcohol content of the wine goes down.
From this chart we see that there is a slightly stronger correlation between alcohol and pH value for the higher quality wines. It is also interesting to see that as the alcohol increases so does the pH of the wine. I did not know the relationship between alcohol and pH until I explored this data set.
From this chart we see that the is a strong negative correlation between density and alcohol. So it seems as the density increases the alcohol content decreases. The highest rated wines have the highest alcohol content as well as the lowest density.
Some of the relationships I discovered while exploring the multivariate plots were the realtionship between alcohol and ph, as the pH increases so does the alcohol. Another interesting correlation was the relationship between residual sugar and alcohol. It seems that the less residual sugar a wine has, then the higher the alcohol will be.
The most interesting interaction to me was the relationship between alcohol and density. I had no idea that the more alcohol the wine has the less dense it will be. It all shows that the higher the alcohol content and lower the density the wine has the higher the quality rating is as well. So a High alcohol, low density wine should be very favorable in terms of rating.
This chart shows a strong correlation between residual sugar and the density of the wine. This makes sense as the more sugar is in each wine it would stand to reason that it would be more dense since Sugar is denser that water. It is important to understand this relationship as it was the strongest correlation in the dataset at .89
These charts show us that as the quality of the wine increase so does the mean of alcohol and pH. Inversely we can see that as quality goes up the mean of density and residual sugar lowers. These are 4 important variable choices to compare against quality. They show us people enjoy a wine that is more basic with a higher alcohol content.
This chart shows us that there is a negative correlation between residual sugar and alcohol percentage. It is interesting to me that the higher quality wines have a higher alcohol content vs the other qualities. This leads us to reason that as residual sugar increases the alcohol content of the wine goes down.
For this exploratory data analysis, I chose the White wines database given the fact that I love white wines. It was really interesting to be able to examine over 4800 different white wines with the checmial breakdowns. It allowed me to get a better understanding about what make a higher quality wine. Specifically a higher quality wine with generally have these traits; Higher Alcohol Content, Lower Residual Sugar, Lower Density, and a higher pH value. To me this is important to know because it can allow you to buy better quality wines. The higher the quality the more enjoyable the experience of drinking is.